Skip to content

refactor: Compact schema_generation to <150 LOC#855

Closed
MQ37 wants to merge 1 commit into
fix/dataset-schema-merge-empty-shapesfrom
refactor/schema-generation-under-150-loc
Closed

refactor: Compact schema_generation to <150 LOC#855
MQ37 wants to merge 1 commit into
fix/dataset-schema-merge-empty-shapesfrom
refactor/schema-generation-under-150-loc

Conversation

@MQ37

@MQ37 MQ37 commented May 16, 2026

Copy link
Copy Markdown
Contributor

Context

Reviewer ceiling for src/utils/schema_generation.ts + tests is 150 LOC of code (comments / blank lines free). The fix in #854 lands at 387.

Solution

Compact both files to 148 combined LOC with the same behavior. Stacked on top of #854 so the diff in this PR is purely the compression — no behavior change to review here.

File #854 This PR
src/utils/schema_generation.ts 122 69
tests/unit/schema_generation.test.ts 265 79
Combined 387 148

Worth your attention

  • Coverage equivalent, fewer test blocks. 28 → 21 tests. The 6 type-union cases and 5 format-detection cases are folded into two it.each tables. Same regression coverage: NYC sushi case, heterogeneous keys, format false-positive guard.
  • Two ! non-null assertions added in merge. Both inside branches guarded by &&/|| predicates immediately above (ap[k] && bp[k] and a.items || b.items). The narrowing alternative was 4 LOC longer.
  • Internal types dropped, public surface unchanged. JsonSchemaArray, JsonSchemaObject, SchemaGenerationOptions, and the removeEmptyArrays export were all unused outside the module. JsonSchemaProperty stays (imported by actor_execution.ts).

Open

Same behavior as the previous commit (set-union merge, type-array
unions, format detection, NYC regression coverage). Tightening:

- Inline `inferType` into `infer`.
- Drop `JsonSchemaArray` / `JsonSchemaObject` types (redundant with
  `JsonSchemaProperty`).
- Drop `SchemaGenerationOptions` type (inlined into signature).
- Tests use `it.each` tables for the 6 type-union cases and 5
  format-detection cases; 28 \u2192 21 tests, coverage equivalent.

src/utils/schema_generation.ts:   122 \u2192 69 LOC
tests/unit/schema_generation.test.ts: 265 \u2192 79 LOC
Combined: 387 \u2192 148 LOC

@jirispilka jirispilka left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you might have guessed already, I prefer the human readable version :D

@MQ37

MQ37 commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@jirispilka ahh, these humans - closing then 😄

@MQ37 MQ37 closed this May 21, 2026
MQ37 added a commit that referenced this pull request May 21, 2026
## Context

`get-dataset-schema`, `call-actor`, and `get-actor-run` all share
`generateSchemaFromItems`, which delegated to `to-json-schema@0.2.5`.
Any two items with different key sets collapsed to
`{type:'array',items:{type:'object'}}` — properties wiped out. Reported
on a real NYC restaurants dataset where ~half the items carried
`markdown` and half didn't.

## Solution

Replaced the library with an in-house inferrer in
`src/utils/schema_generation.ts`. The merge does a set-union of property
keys and recurses; primitive type conflicts emit JSON Schema `type`
arrays (e.g. `["string","null"]`). Drops the `arrayMode` field from
`get-dataset-schema` — it only existed as a workaround for the buggy
`mode:'all'`, and all internal callers were already passing it anyway.

## Worth your attention

- **No external dependency, no supply-chain surface.** `to-json-schema`
was last published in 2020 and the upstream repo is dead. Owning ~120
LOC of pure JSON-Schema inference is cheaper than auditing an
unmaintained transitive surface on a server that handles customer Apify
tokens.
- **Type-array unions for primitive conflicts.** `{x:1}` + `{x:"hi"}`
produces `{"type":["integer","string"]}` — spec-valid JSON Schema,
handled natively by LLMs reading the tool output. Verified the generated
schema is never Ajv-validated downstream (checked both this repo and
`apify-mcp-server-internal` — Ajv only validates tool *input* args).
- **`arrayMode` field removed from `get-dataset-schema`.** Technically a
public API change. Safe because (a) all 3 internal callers always passed
`arrayMode:'all'`, and (b) the `'first'` mode was never useful —
`to-json-schema` applies it recursively to nested arrays too, which is
almost never what callers want.
- **Drops the upstream's `format:"style"` false positive.** Free-form
Markdown text was being tagged with a CSS-ish format. The new format
detector covers only `uri`, `date-time`, `date`, `email`, `uuid` — the
unambiguous ones.

## Follow-up

- **#855 stacks a compact rewrite on top of this PR — same behavior, 387
→ 148 combined code LOC.** Merge order: this PR first, then #855. Or
squash-merge #855 alone to replace this. This PR ships the verbose,
easy-to-read version for review clarity.

---------

Co-authored-by: Jiří Spilka <jiri.spilka@apify.com>
jirispilka added a commit that referenced this pull request May 26, 2026
## Context

`get-dataset-schema`, `call-actor`, and `get-actor-run` all share
`generateSchemaFromItems`, which delegated to `to-json-schema@0.2.5`.
Any two items with different key sets collapsed to
`{type:'array',items:{type:'object'}}` — properties wiped out. Reported
on a real NYC restaurants dataset where ~half the items carried
`markdown` and half didn't.

## Solution

Replaced the library with an in-house inferrer in
`src/utils/schema_generation.ts`. The merge does a set-union of property
keys and recurses; primitive type conflicts emit JSON Schema `type`
arrays (e.g. `["string","null"]`). Drops the `arrayMode` field from
`get-dataset-schema` — it only existed as a workaround for the buggy
`mode:'all'`, and all internal callers were already passing it anyway.

## Worth your attention

- **No external dependency, no supply-chain surface.** `to-json-schema`
was last published in 2020 and the upstream repo is dead. Owning ~120
LOC of pure JSON-Schema inference is cheaper than auditing an
unmaintained transitive surface on a server that handles customer Apify
tokens.
- **Type-array unions for primitive conflicts.** `{x:1}` + `{x:"hi"}`
produces `{"type":["integer","string"]}` — spec-valid JSON Schema,
handled natively by LLMs reading the tool output. Verified the generated
schema is never Ajv-validated downstream (checked both this repo and
`apify-mcp-server-internal` — Ajv only validates tool *input* args).
- **`arrayMode` field removed from `get-dataset-schema`.** Technically a
public API change. Safe because (a) all 3 internal callers always passed
`arrayMode:'all'`, and (b) the `'first'` mode was never useful —
`to-json-schema` applies it recursively to nested arrays too, which is
almost never what callers want.
- **Drops the upstream's `format:"style"` false positive.** Free-form
Markdown text was being tagged with a CSS-ish format. The new format
detector covers only `uri`, `date-time`, `date`, `email`, `uuid` — the
unambiguous ones.

## Follow-up

- **#855 stacks a compact rewrite on top of this PR — same behavior, 387
→ 148 combined code LOC.** Merge order: this PR first, then #855. Or
squash-merge #855 alone to replace this. This PR ships the verbose,
easy-to-read version for review clarity.

---------

Co-authored-by: Jiří Spilka <jiri.spilka@apify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants